-
Notifications
You must be signed in to change notification settings - Fork 11
[New Check] Entire column of a table is not NaN #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
nwbinspector/checks/tables.py
Outdated
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | ||
| if np.any(~np.isnan(column[subindex_selection])): | ||
| continue | ||
| else: | ||
| yield InspectorMessage( | ||
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| subindex_selection = np.unique(np.round(np.linspace(start=0, stop=column.shape[0] - 1, num=nelems)).astype(int)) | |
| if np.any(~np.isnan(column[subindex_selection])): | |
| continue | |
| else: | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) | |
| if np.all(np.isnan(column[:nelems])): | |
| yield InspectorMessage( | |
| message=f"Column {column.name} has all NaN values. Consider removing it from the table." | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's written this way in case np.any has an early return condition once it detects a single non-NaN value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.all can have the same early stopping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, interesting results...
Indeed as you say, all does indeed have early stopping just like any...
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.006882190704345703s!
Took 3.719329833984375e-05s!... however numpy does not appear to (likely because it's vectorized). Maybe that gives better performance for very large in-memory arrays, but for sparse inputs the early return from Python built-ins might be nicer.
nelems = 1000000
array1 = [True for _ in range(nelems)]
array1 += [False]
array2 = [False]
array2 += [True for _ in range(nelems)]
start = time()
np.all(array1)
runtime = time() - start
print(f"Took {runtime}s!")
start = time()
np.all(array2)
runtime = time() - start
print(f"Took {runtime}s!")
----------------------------------
Took 0.05324196815490723s!
Took 0.05084705352783203s!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By sparse do you mean short?
Either approach is fine with me, either using all or np.all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 'sparsity' (% of the number of False items being passed into all) isn't as relevant - I just mean for any case where there is at least one False element occurring somewhat early on in the input to all (as opposed to occurring later in the vector, as seen above), it will exit a few ~ms faster per check (which could add up to seconds when streaming over an entire dataset).
Anyway, I implemented the all logic here.
…awithoutborders/nwbinspector into check_table_col_not_nan
for more information, see https://pre-commit.ci
…awithoutborders/nwbinspector into check_table_col_not_nan
|
The idea is if nelems = None then it reads all the data. Could you make that true for this usage mode too? |
Done |
Co-authored-by: Ben Dichter <[email protected]>
|
how would you feel about moving this one line into the check instead of making it its own function? |
Codecov Report
@@ Coverage Diff @@
## dev #231 +/- ##
==========================================
+ Coverage 94.87% 94.93% +0.06%
==========================================
Files 17 17
Lines 936 948 +12
==========================================
+ Hits 888 900 +12
Misses 48 48
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Ben Dichter <[email protected]>
Co-authored-by: Ben Dichter <[email protected]>
Yeah that's fine since it's so parred down now, the only complicated bit is the logic for determining the 'by' of the slice but that should be readable enough as is - thanks for helping make that so much simpler! |
Along with #229, finishes #155
Whenever an entire column of a table is
NaN, informs the user that there probably isn't a need to include it. Samples thenelemsuniformly across the length of the column since it's quite possible to have a block of consecutive entries (evennelemsmany) that are purposefully and even informativelyNaN.Only giving it a
SUGGESTIONlevel of importance since this will fail for large columns of sparse non-NaN, and also could be a design choice of the user.